Cocojunk

🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.

Navigation: Home

Lock-free and wait-free algorithms

Published: Sat May 03 2025 19:23:38 GMT+0000 (Coordinated Universal Time) Last Updated: 5/3/2025, 7:23:38 PM

Read the original article here.

Okay, buckle up. You're stepping off the well-trodden path of mutexes and semaphores and diving into the high-stakes world of non-blocking concurrent programming. This is where performance bottlenecks are surgically removed, and system resilience gets a serious upgrade – if you can master the techniques. Welcome to the chapter on Lock-Free and Wait-Free algorithms, a cornerstone of "The Forbidden Code."

They don't push these concepts in introductory courses because they're complex, tricky to get right, and demand a deep understanding of hardware architecture and memory models. But in scenarios craving ultimate performance, guaranteed responsiveness, or bulletproof resistance to classic synchronization problems, these methods are indispensable.

The Forbidden Code: Diving Deep into Non-Blocking Concurrency

Chapter: Lock-Free and Wait-Free Algorithms

Introduction: The Problem with Locks

In the realm of concurrent programming, where multiple threads or processes access shared resources, synchronization is paramount. The standard approach involves using locks (like mutexes or semaphores) to ensure that only one thread can access a critical section of code at a time. This prevents data corruption. However, this conventional wisdom comes with significant drawbacks that can cripple performance and introduce insidious bugs in complex systems.

Let's revisit the fundamental issues with lock-based synchronization:

Blocking: When a thread tries to acquire a lock that is already held by another thread, it blocks. This means the thread pauses execution, potentially for an indefinite amount of time, until the lock is released. In a multi-core system, this is wasted CPU time.
Contention: When many threads simultaneously try to acquire the same lock, they create contention. The operating system or runtime has to manage these competing requests, often leading to increased overhead from context switching as it suspends and resumes threads.
Deadlock: A classic and hard-to-debug problem where two or more threads are blocked indefinitely, each waiting for the other to release a resource (often a lock). Imagine Thread A holds Lock X and needs Lock Y, while Thread B holds Lock Y and needs Lock X. Neither can proceed.
Starvation: A situation where a thread is perpetually blocked from accessing a resource it needs, even though the resource becomes available at times. This can happen due to unfair scheduling or priority issues (e.g., a high-priority thread constantly acquires a lock before a low-priority thread gets a chance).
Priority Inversion: A critical issue in real-time systems. A high-priority thread needs a resource held by a low-priority thread. The high-priority thread blocks. If a medium-priority thread then preempts the low-priority thread, the low-priority thread never gets to release the resource, and the high-priority thread remains blocked by the medium-priority thread, effectively inverting their priorities.

These issues demonstrate that while simple locks are easy to grasp initially, they can introduce fragility and performance bottlenecks, especially under high contention or in systems requiring guaranteed response times. This is where the techniques they don't always teach you come in: Non-blocking algorithms.

Non-Blocking Synchronization: A Different Paradigm

Instead of preventing access via mutual exclusion (locking), non-blocking algorithms allow multiple threads to attempt to access and modify shared data simultaneously. They rely on low-level atomic operations provided by the hardware to detect interference from other threads and, if interference occurs, retry the operation.

The core idea is that if an operation fails due to concurrent modification, the thread doesn't block. It simply tries again, often with slightly updated information. This "optimistic" approach assumes conflicts are relatively rare or handles them efficiently when they occur.

Non-blocking algorithms offer stronger guarantees than lock-based ones regarding system-wide progress and, in some cases, per-thread progress. These guarantees lead to two primary categories: Lock-Free and Wait-Free.

Definition: Atomic Operation An operation that is guaranteed to complete without being interrupted by other threads or processes. From the perspective of other threads, an atomic operation happens instantaneously; they either see the state before the operation or the state after the operation, never an intermediate state. Modern CPUs provide instructions for atomic reads, writes, increments, and complex operations like Compare-and-Swap (CAS).

Understanding Lock-Free Algorithms

Lock-free is the less strict, and more common, of the two stronger non-blocking guarantees.

Definition: Lock-Free Progress Property A system composed of multiple threads is lock-free if, when the system as a whole is running, at least one thread is guaranteed to make progress towards completing its operation, even if other threads are arbitrarily delayed (except by crashing). No global deadlock is possible, and the system as a whole cannot be stalled by any single thread failing or being descheduled (unless it crashes).

Explanation:

Think of a group of people trying to update a single shared online document without locking it for exclusive editing. Everyone downloads a copy, makes changes, and tries to upload. If someone uploads their changes between you downloading and you attempting to upload, your upload attempt fails (because the document has changed based on your old version). You don't wait; you just download the new version and try again.

In a lock-free algorithm:

Multiple threads might attempt an operation (e.g., adding an item to a queue).
They use atomic operations (like CAS) to try and make their change.
If the atomic operation fails (meaning another thread modified the data concurrently), the thread retries the operation, typically in a loop, until it succeeds.
While an individual thread might repeatedly fail and retry (potentially experiencing starvation), the guarantee is that some thread in the system will succeed in its operation, ensuring the system as a whole makes progress.

Example Scenario: Imagine a lock-free counter. Multiple threads want to increment a shared integer count. A lock-based approach would involve lock(); count++; unlock();. A lock-free approach using CAS might look like this:

std::atomic<int> count = 0; // Uses underlying atomic operations

void increment_lock_free() {
    int current_value;
    int new_value;
    do {
        current_value = count.load(); // Atomically read current value
        new_value = current_value + 1;
        // Atomically check if count is still current_value, if so, set to new_value
        // If it was changed by another thread, this returns false, loop continues
    } while (!count.compare_exchange_weak(current_value, new_value));
}

In this example, if two threads try to increment count simultaneously, only one compare_exchange_weak operation can succeed for a given current_value. The thread whose CAS fails will reload the new count value and retry. At least one of them will succeed on any given pass through the contention.

Benefits of Lock-Free:

Avoids deadlocks.
Avoids priority inversion.
Can provide higher throughput under high contention compared to locks, as threads don't block the entire system.
Ensures system-wide progress.

Drawbacks of Lock-Free:

Individual threads can still starve (be delayed indefinitely by others).
Algorithms are significantly more complex to design and implement correctly.
Debugging is notoriously difficult.
Relies heavily on specific hardware atomic instructions.
Can involve "livelock" where threads are busy retrying but making no effective progress due to constant conflicts (though system progress is still guaranteed).

Understanding Wait-Free Algorithms

Wait-free is the strongest non-blocking guarantee. It builds upon lock-free by eliminating the possibility of starvation.

Definition: Wait-Free Progress Property A system or an operation is wait-free if every thread attempting an operation is guaranteed to complete it within a finite number of its own steps, regardless of the execution speed or failures of other threads (except for a thread crashing mid-operation, which might affect the operation but doesn't halt others).

Explanation:

Going back to the shared document analogy: in a wait-free system, there's a mechanism (perhaps a round-robin system managed atomicity) that ensures everyone trying to upload their changes will eventually succeed within a predictable number of their own tries, no matter how busy or slow others are.

In a wait-free algorithm:

Not only is system-wide progress guaranteed (like lock-free), but each individual thread is guaranteed to complete its operation within a bounded number of steps.
This bound is independent of the behavior of other threads; it only depends on the algorithm itself and the specific thread's execution.
Starvation is impossible by definition.

Example Scenario: Implementing a wait-free queue is significantly more complex than a lock-free one. A simple CAS-based push/pop can lead to starvation. Wait-free data structures often involve techniques where threads can help complete operations started by other threads or use more sophisticated atomic primitives or multi-word atomic operations (if available). A simple wait-free counter might use Fetch-and-Add (FAA) if the hardware supports it atomically. FAA(&count, 1) is wait-free as it completes in a single atomic step for any thread.

Benefits of Wait-Free:

Strongest non-blocking guarantee: eliminates starvation.
Guarantees per-thread progress within a bounded number of steps, making them suitable for strict real-time systems.
Avoids deadlocks and priority inversion.

Drawbacks of Wait-Free:

Extremely difficult to design and implement.
Can have higher overhead than lock-free or even lock-based approaches in low-contention scenarios due to the complexity required to guarantee finite steps for all threads.
Fewer known general-purpose wait-free data structures compared to lock-free ones.
Often requires advanced hardware support or complex software techniques (like helping).

Comparing Lock-Free and Wait-Free

Feature	Lock-Based (Mutex)	Lock-Free	Wait-Free
System Progress	No (subject to deadlock/stall)	Yes (at least one thread)	Yes (all threads)
Per-Thread Progress	No (subject to starvation)	No (subject to starvation)	Yes (finite steps)
Deadlock	Possible	Impossible	Impossible
Starvation	Possible	Possible	Impossible
Priority Inversion	Possible	Impossible	Impossible
Complexity	Simple for basic use	High	Very High
Use Cases	General purpose synchronization	High-contention, libraries	Real-time systems, high reliability
Overhead (Low Contention)	Low	Moderate (due to retries/atomics)	High (due to complexity/helping)
Overhead (High Contention)	High (due to blocking/contention)	Low to Moderate	Moderate to High

Wait-free algorithms are a subset of lock-free algorithms. If an algorithm is wait-free, it is also lock-free, but the reverse is not true.

The Foundation: Atomic Primitives

Non-blocking algorithms are built upon the bedrock of atomic operations provided by the hardware. These are CPU instructions that complete in a single, uninterruptible step. Without these, building reliable non-blocking algorithms would be impossible.

The most fundamental and widely used atomic primitive for building complex non-blocking algorithms is:

Definition: Compare-and-Swap (CAS) An atomic operation that takes three operands: the memory location M, an expected_value, and a new_value. It performs the following action atomically:

Read the current value at M.

If the current value is equal to expected_value, write new_value to M.

Return true if the write occurred (the values matched), otherwise return false.

Symbolically: bool CAS(M, expected_value, new_value)

How CAS is Used:

CAS is the basis for optimistic concurrency. A thread reads a shared value, computes a new value based on the old one, and then uses CAS to try and update the shared location only if it hasn't changed since the original read.

// Example using CAS to update a shared pointer 'head' for a stack
Node* old_head;
Node* new_head = new_node; // Node to push

do {
    old_head = head.load(); // Atomically read current head
    new_head->next = old_head; // Link the new node to the old head
    // Attempt to set head to new_head ONLY IF it's still old_head
} while (!head.compare_exchange_weak(old_head, new_head)); // Retry if CAS fails

If compare_exchange_weak returns false, it means head was modified by another thread between the load() and the CAS attempt. The loop retries, reloading the new head value and trying the link-and-CAS again.

Other Atomic Primitives:

While CAS is king, other atomic operations exist and can sometimes simplify specific non-blocking operations:

Fetch-and-Add (FAA): Atomically reads a value, adds a constant to it, and writes the result back. Returns the original value. (Used for counters, sequence numbers).
Load-Linked/Store-Conditional (LL/SC): A pair of instructions. LL reads a value and "links" the memory location. SC attempts to write to the linked location; it succeeds only if the location hasn't been written to since the LL. If it succeeds, it breaks the link. Often more flexible than CAS for complex multi-word operations, but less common than CAS across architectures.
Atomic Exchange: Atomically writes a new value to a location and returns the old value.

These primitives are typically exposed in modern programming languages through atomic libraries (e.g., std::atomic in C++, java.util.concurrent.atomic in Java).

Implementing Non-Blocking Data Structures

Building lock-free or wait-free data structures is where the real "Forbidden Code" challenge lies. Simple operations like counters are relatively straightforward with atomic primitives. Complex structures like stacks, queues, lists, and hash maps are significantly harder.

The general pattern involves loops using atomic operations (often CAS) to attempt modifications. The difficulty arises in managing the state of the data structure such that any number of threads can simultaneously attempt operations without corrupting the structure, and ensuring the desired progress property (lock-free or wait-free) is met.

Key Challenges in Implementation:

The ABA Problem: This is a classic pitfall when using CAS with reusable memory locations (like pointers).
- Thread A reads a value (e.g., a pointer P) which points to node A.
- Thread B removes node A, then performs some operations, and eventually allocates a new node C at the same memory address that node A used to occupy. Now P points to node C, but its value is the same as when Thread A read it (P still holds the address of 'A', which is now node C).
- Thread A attempts a CAS operation expecting P to still hold the old address. Since the value is the same, the CAS succeeds, but Thread A is now operating on node C, incorrectly assuming it's still node A. This can corrupt the data structure.
Context: The ABA Problem A subtle race condition in lock-free algorithms using CAS where a memory location's value changes from A to B and then back to A between a thread's initial read and its CAS attempt. Because the value is 'A' again, the CAS succeeds even though the underlying state it represents has changed, potentially leading to algorithm failure or data corruption.
- Solutions for ABA: The most common solution is to add a version counter or tag to the shared value (e.g., pair<Pointer, int>). The atomic operation (often a double-width CAS, like DCAS or CAS2, if supported by hardware/software) must compare both the pointer and the version counter. If the address changes from A to B and back to A, the version counter will have incremented (e.g., A-v1 -> B-v2 -> A-v3), and the CAS will fail correctly because the version doesn't match.
Memory Reclamation: In lock-free data structures, a thread might remove a node, but another thread might still hold a pointer to that node (e.g., from a read operation that occurred before the removal, but the thread hasn't finished its operation yet). Freeing the node's memory immediately is unsafe, as the other thread might try to access freed memory (a use-after-free bug).

Context: Safe Memory Reclamation (SMR) Techniques used in non-blocking algorithms to determine when memory occupied by removed nodes can be safely freed. Since threads might hold pointers to "logically" removed nodes, SMR methods ensure that no thread is currently accessing a node before its memory is reclaimed. Common techniques include Hazard Pointers, RCU (Read-Copy-Update), and epoch-based reclamation.
- Solutions for Memory Reclamation: These are complex systems in themselves. Techniques like Hazard Pointers allow threads to register pointers they are currently "using," preventing those nodes from being freed until the thread unregistered them. RCU is another sophisticated technique often used in operating systems for safe reading while updates occur. Implementing or even using these SMR systems correctly adds significant complexity.

When to Choose (and Avoid) Non-Blocking Techniques

These techniques are powerful but not a silver bullet. Knowing when to deploy them is crucial.

Use Non-Blocking (Lock-Free or Wait-Free) When:

High Contention is Expected: When many threads frequently access and modify the same shared data structure. Under high contention, locks can become a serious bottleneck as threads queue up and context switch. Non-blocking algorithms can allow threads to make progress via retries.
Avoiding Deadlock/Starvation is Critical: In systems where these issues are unacceptable, such as real-time systems (where wait-free is often required) or highly fault-tolerant systems.
Implementing Low-Level Libraries: Standard library concurrency primitives or OS kernels often use non-blocking techniques internally for performance and robustness.
Specific Data Structures Map Well: Some data structures (like simple queues or stacks) have relatively well-understood lock-free implementations.

Avoid Non-Blocking Techniques When:

Synchronization Needs are Simple: For basic synchronization (e.g., protecting a counter accessed rarely), a simple mutex is usually much easier, less error-prone, and the overhead is negligible.
Contention is Low: If threads rarely access the shared resource simultaneously, the overhead of atomic operations, retry loops, and complex memory management in non-blocking algorithms can be higher than the cost of a simple lock acquisition.
Complexity is a Major Concern: Non-blocking algorithms are inherently more complex to design, implement, and especially debug. Trace issues in a lock-free algorithm is notoriously difficult as race conditions manifest subtly and non-deterministically across retries.
Memory Management is Not Addressed: Building non-blocking data structures without a solid plan for safe memory reclamation is a recipe for disaster (use-after-free bugs).

Conclusion: Unleashing the Power (Responsibly)

Lock-free and wait-free algorithms are undoubtedly part of the "Forbidden Code" – they are advanced, require a deeper technical understanding than conventional locking, and demand meticulous attention to detail.

They offer compelling advantages in specific scenarios: eliminating deadlocks and priority inversion, providing stronger progress guarantees, and potentially delivering higher performance under extreme contention. Wait-free algorithms are particularly valuable for real-time systems where bounded execution time is a hard requirement.

However, their complexity, the challenges of the ABA problem and safe memory reclamation, and the difficulty of debugging mean they should not be the default choice. Use them when the problems they solve (high contention bottlenecks, strict real-time constraints, avoiding deadlock/starvation) are actual, measured issues that cannot be adequately addressed with simpler techniques.

Mastering lock-free and wait-free programming requires not just understanding the concepts but also gaining practical experience with atomic operations, careful algorithm design, and robust memory management techniques. It's a challenging but rewarding area for those looking to push the boundaries of concurrent system performance and reliability.